An Open Linguistic Infrastructure for Annotated Corpora

نویسنده

  • Nancy Ide
چکیده

Annotated corpora are a fundamental resource for research and development in the field of natural language processing (NLP). Although unannotated corpora (for example, Gigaword, Wikipedia, etc.) are often used to build language models, annotations for linguistic phenomena provide a richer set of features and hence, potentially better models in the long run. It is widely accepted that a first step in the pursuit of NLP applications for any language is to develop a high quality annotated corpus with at least a basic set of annotations for phenomena such as part of speech and shallow syntax, while corpora for languages such as English, for which substantial annotated resources already exist, are increasingly being enhanced to include additional annotations for semantic and discourse phenomena (e.g., semantic roles, sense annotations, coreference, named entities, discourse structure). This is occurring for at least two reasons: first, more and deeper linguistic information, together with study of intra-level interactions, may lead to insights that can improve NLP applications; and second, in order to handle more subtle and difficult aspects of language understanding, there is a trend away from purely statistical approaches and (back) toward symbolic or rule-based approaches. Richly annotated corpora provide the raw materials for this kind of development. As a result, there is an increased demand for high quality linguistic annotations of corpora representing a wide range of phenomena, especially at the semantic level, to support machine learning and computational linguistics research in general. At the same time, there is a demand for annotated corpora representing a broad range of genres, due to the impact of domain on both syntactic and semantic characteristics. Finally, there is a keen awareness of the need for annotated corpora that are both easily accessible and available for use by anyone.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MultiMASC: An Open Linguistic Infrastructure for Language Research

This paper describes MultiMASC, which builds upon the Manually Annotated Sub-Corpus (MASC) (Ide et al., 2008; Ide et al., 2010) project, a community-based collaborative effort to create, annotate, and validate linguistic data and annotations on a broad-genre open language data. MultiMASC will extend MASC to include comparable corpora in other languages that not only represent the same genres an...

متن کامل

Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastructure

Multilingual technologies, which to a large extent are language independent, provide a powerful support for easier building of annotated linguistic resources for languages where such resources are scarce or missing. All these technologies require parallel corpora in order to achieve their ends. Parallel texts encode extremely valuable linguistic knowledge because the linguistic decisions made b...

متن کامل

Linguistically Annotated Learner Corpora: Aspects of a Layered Linguistic Encoding and Standardized Representation

Linguistically annotated corpora that are stored in standardized digital form can be a valuable source of empirical insight. They can help verify linguistic generalizations and support the formulation of new hypotheses. The linguistic annotation of such corpora often is crucial for their effective exploration from a linguistic perspective. The annotation essentially serves as an index to the li...

متن کامل

graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora

We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graphbased abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show th...

متن کامل

Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora

Large and open multiparallel corpora are a valuable resource for contrastive corpus linguists if the data is annotated and stored in a way that allows precise and flexible ad hoc searches. A linguistic query language should also support computational linguists in automated multilingual data mining. We review a broad range of approaches for linguistic query and reporting languages according to u...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013